High-dimensional quantile mediation analysis with application to a birth cohort study of mother–newborn pairs

Abstract Motivation There has been substantial recent interest in developing methodology for high-dimensional mediation analysis. Yet, the majority of mediation statistical methods lean heavily on mean regression, which limits their ability to fully capture the complex mediating effects across the outcome distribution. To bridge this gap, we propose a novel approach for selecting and testing mediators throughout the full range of the outcome distribution spectrum. Results The proposed high-dimensional quantile mediation model provides a comprehensive insight into how potential mediators impact outcomes via their mediation pathways. This method’s efficacy is demonstrated through extensive simulations. The study presents a real-world data application examining the mediating effects of DNA methylation on the relationship between maternal smoking and offspring birthweight. Availability and implementation Our method offers a publicly available and user-friendly function qHIMA(), which can be accessed through the R package HIMA at https://CRAN.R-project.org/package=HIMA.


Introduction
Mediation analysis is an important statistical technique, elucidating the mechanisms of mediators that bridge the connection between an independent variable (e.g.exposure or treatment) and a dependent variable (e.g.health outcome).With the advancement of data collection technology, highdimensional data have become increasingly prevalent in various fields.As a result, high-dimensional mediation analysis has piqued the interest of a plethora of researchers recently.For example, Zhang et al. (2016) proposed a novel method to estimate and test high-dimensional mediation effects in epigenetic studies.Methods to estimate and test mediation effects with high-dimensional compositional microbiome data have been proposed by researchers like Sohn and Li (2019), Wang et al. (2020), Zhang et al. (2021a), and Zhang et al. (2021b).Djordjilovi� c et al. (2019) and Fang et al. (2021) studied the group testing for high-dimensional mediation effects.Luo et al. (2020) and Zhang et al. (2021c) embarked on exploring high-dimensional mediation analysis specifically for the survival outcome.Additionally, Dai et al. (2022) and Liu et al. (2022) made significant strides in hypothesis testing, with the former devising a multiple-testing procedure for highdimensional mediation hypotheses and the latter probing into large-scale hypothesis testing for causal mediation effects.Building on previous works, Perera et al. (2022) refined estimation and inference procedures for the high-dimensional linear mediation model, extending the foundations laid by Zhang et al. (2016).For an expansive understanding of highdimensional mediation analysis, please refer to two review papers by Zeng et al. (2021) and Zhang et al. (2022).
It is worth noting that many of the aforementioned studies primarily employed traditional mean-regression methods to analyze outcomes.However, these mean-centric techniques may not fully capture the mediation effects across the entire spectrum of the outcome distribution.For instance, Shen et al. (2014) introduced a quantile mediation model focusing on the outcome distribution, with a single mediator modeled through linear regression.Their study found that the mediation effects of walkability on body mass index (BMI) were substantially larger at the upper quantiles as opposed to the median or average.Essentially, individuals with elevated BMI levels might respond differently to interventions compared to those with average BMIs.Relying solely on mean regression could therefore offer an incomplete perspective of mediation effects across various outcome distribution segments.In a parallel vein, Bind et al. (2017) delved into the controlled direct and indirect effects of exposure across percentiles of both the mediator and outcome, taking into account longitudinal data.
It is necessary to point out that the models presented by Shen et al. (2014) and Bind et al. (2017) honed in exclusively on single-mediator cases.Given the prevalence of highdimensional data, there is a pressing need to develop innovative statistical methods for quantile mediation analysis that can accommodate high-dimensional mediators.The motivation for our work stems from a birth cohort study, examining 954 mother-newborn pairs from a primarily urban, low-income, multi-ethnic US birth cohort.A total of 865 859 CpG sites were measured for each newborn cord blood DNA sample.The study by Xu et al. (2021) revealed that newborns exposed to smoking weighed an average of 258 g less than their unexposed counterparts.Moreover, their findings suggested that the impact of maternal smoking on offspring birthweight is mediated by DNA methylation, as indicated by meanregression models.Probing into the mediation effects of DNA methylation specifically on low birth weight (LBW) or high birth weight (HBW) individuals, as opposed to those of average weight, introduces a riveting topic of mediation analysis.
As of our current understanding, there are no existing methods in published literature tailored for high-dimensional quantile mediation analysis.In this study, we introduce a novel quantile mediation model that incorporates high-dimensional mediators.Here, the outcome Y is characterized by a quantile regression model, while the mediators M k s are represented through a collection of linear regression models.The advantages of our proposed methodology are manifold.First, individuals experiencing extreme medical outcomes might be at a heightened risk for specific diseases.Traditional regression methods, grounded in averages, might overlook the influences of markers on outcomes at the tails of the distribution.Our quantile-focused mediation models could provide a comprehensive picture of mediating mechanisms that occur at the tails of outcome distributions.Second, we employ a screening procedure to reduce the dimension of mediators, which could considerably reduce the computational burden during the estimation phase.Notably, post-screening survivors predominantly include mediators with significant relevance, elevating the method's precision.Third, to identify active mediators, we introduce a statistic grounded in a joint significance test, ensuring mediator selection with desired confidence levels.
The remainder of this article is organized as follows: Section 2 unveils our newly developed quantile mediation model, accommodating high-dimensional mediators.We also present a three-step approach for pinpointing the most influential mediators.In Section 3, simulations are carried out to evaluate the performance of our proposed methodology.Section 4 comprises the application of our advanced quantile mediation model to a birth cohort study involving multiethnic US mother-newborn pairs.Finally, Section 5 encapsulates our concluding remarks and future directions.

Model and method
Let X, M ¼ ðM 1 ; . . .; M p Þ 0 , Z ¼ ðZ 1 ; . . .; Z q Þ 0 , and Y represent the observed exposure, mediators, covariates, and outcome, respectively, where p is the number of mediators and q is the dimension of Z. Denote by Q s ðYjZÞ the sth percentile of the conditional distribution of outcome Y given Z.Within the potential outcome framework, MðxÞ ¼ ðM 1 ðxÞ; . . .; M p ðxÞÞ 0 represents the potential mediators that would have been observed if X had been set to x, while Yðx; mÞ denotes the potential outcome that would have been observed if X had been set to x and M had been set to m. Denote by Q s fYðx; mÞjZg the sth percentile of the conditional distribution of the potential outcome Yðx; mÞ given Z.In the potential outcome framework, we propose a highdimensional linear quantile mediation model (Fig. 1), which is presented as follows: (1) where m ¼ ðm 1 ; . . .; m p Þ 0 is a high-dimensional vector of mediators; Z ¼ ðZ 1 ; . . .; Z q Þ 0 is a vector of confounding variables or covariates; c s is the "direct effect" of X on the sth quantile of Y, after adjusting for all mediators and covariates; a ¼ ða 1 ; . . .; a p Þ 0 is a vector of parameters relating the exposure to p mediating variables; and b s ¼ ðb 1;s ; . . .; b p;s Þ 0 is a vector of parameters relating the mediators to the sth quantile of the dependent variable adjusting for the effects of the exposure and covariates; f k 's and g are the parameters of covariates.In addition, c s and c k 's are the intercept terms; e k 's are zero-mean error terms.Let Q s fYðx; E½Mðx � ÞjZ�ÞjZg represent the sth quantile of the conditional distribution of the potential outcome that would have been observed if X were set to x and m were set to its conditional expected counterfactual value E½Mðx � ÞjZ�.Under the regular assumptions in the Supplementary Materials, we have derived that where We define the controlled direct effect (CDE) as The controlled indirect effect (CIE) is defined as (3) Note that the definitions of natural direct effect (NDE) and natural indirect effect (NIE) (in the Supplementary Material) cannot be applied in this context due to our specific focus on outcome quantiles.The CIE through the path X !M k !Q s ðYjX; M; ZÞ, as expressed by a k b k;s for k ¼ 1; . . .; p, is derived from (3).This implies that the sth conditional quantile of Y serves as the primary outcome of interest.The set X s ¼ fk : a k b k;s 6 ¼ 0; k ¼ 1; . . .; pg denotes the indices of statistically significant mediators at a given quantile level s 2 ð0; 1Þ.Here we assume the sparsity of active mediators with jX s j � p.By virtue of ( 1) and ( 2), we can reformulate it as a structural equation model (SEM) to evaluate the mediating effects of high-dimensional mediators on the outcome distribution: Assume that the observed samples are X i , M i ¼ ðM i1 ; . . .; M ip Þ 0 , and Y i , where i ¼ 1; . . .; n.Our main interest is to provide an estimated index set Xs with desirable confidence.To achieve this objective, we propose a three-step approach as outlined below.
Step 1. (Mediator screening).First, the mediators are standardized to ensure the regression coefficients are on the same scale.For k ¼ 1; . . .; p, we perform a series of marginal quantile models for the p mediators: Based on the marginal quantile screening idea of Li et al. (2023), we can identify a subset I s ¼ f1 � k � p : M k is among the top d ¼ 2½n= logðnÞ� mediators having the largest j bk jg, where bk is the estimate in marginal model (4).
The employed marginal quantile screening approach can be considered as a specific parametric case of Li et al. (2023), which has demonstrated a high probability that the nonzero b k;s 's belong to I s .In other words, we can reduce the dimension of mediators from p to d, while ensuring that active mediators are retained with a probability approaching to one.
Step 2. (Penalized estimate).Conduct variable selection for mediators fM k g k2I s by minimizing the penalty-based criterion, where h s ¼ ðb 0 I ;s ; c; c s ; g 0 s Þ 0 2 R dþqþ2 and b I ;s denotes a subvector of b with indices belonging to I .q s ðvÞ ¼ vfs−Iðv < 0Þg is the check function, and penð�Þ is the minimax concave penalty (MCP; Zhang 2010) with the following expression: Here k > 0 is the regularization parameter, and d > 0 determines the concavity of MCP.The MCP procedure for quantile regression has been implemented by Tan et al. (2022).In practical applications, the R package conquer can be utilized to solve Equation ( 5).Let S s ¼ fk : bk;s 6 ¼ 0g denote the index set of survived mediators after Step 2, where bk;s 's are the penalized estimates in (5).
Step 3. (Mediator selection).To conduct mediator selection, we take into account the refitted sub-model as follows: where M Ss denotes a sub-vector of M with indices belonging to S s .The parameter estimator bk;s and its standard error rb k in model ( 6) can be easily obtained by the R function rq().
The JS-type decision statistic is defined as Based on the above three-step approach, an estimated set of indices for significant mediators is derived as XJS ðsÞ ¼ fk : D JS k;s < 0:05; k 2 S s g, where D JS k;s is defined in (7).We provide a publicly accessible and user-friendly function qHIMA(), which can be conveniently accessed through the R package HIMA.
First, we generate random samples with models (8), ( 9) and (10) to evaluate the proposed method in Section 2. The quantile level is chosen as s ¼ 0.05, 0.25, 0.5, 0.75, and 0.95, respectively.In Tables 1 and 2, we report the simulation results on the mediation effects fa k b k;s g 6 k¼1 , including the bias (Bias) given by the difference of sample means of estimators and the true value, together with the sampling standard error (SSE).The estimation results for a k b k;s 's (k ¼ 7; . . .; p) are similar to that of a 6 b 6;s , so we omit the details.
The results in Tables 1 and 2 demonstrate that our method exhibits significantly smaller Bias and SSE compared to Shen et al. and Bind et al.Hence, the one-by-one mediator procedure is not applicable for estimating high-dimensional quantile mediation effects.To describe the accuracy of our mediator selection method, we report the following three results: Model size (MS): j X0 ðsÞj, where j X0 ðsÞj denotes the number of elements in X0 ðsÞ; True positive proportion (TPP): j X0 ðsÞ \ X 0 ðsÞj=jX 0 ðsÞj; False discovery proportion (FDP): j X0 ðsÞ n X 0 ðsÞj=j X0 ðsÞj, where j X0 ðsÞ n X 0 ðsÞj denotes the set difference of X0 ðsÞ and X 0 ðsÞ.
Tables 3 and 4 present the mediator selection results in terms of MS, TPP, and FDP.These results suggest that Shen et al. and Bind et al. tend to select a smaller model with lower TPP and FDP, i.e. the one-by-one mediator procedure is more conservative than our JS-based procedure.Moreover, all methods become better as the sample size n increases.
Overall, the proposed mediator selection method works well for high-dimensional quantile mediation model in practical applications.From the view of computational efficiency, our method takes approximately 245.34 seconds for one repetition with Case I and n ¼ 300, where the computation is implemented on a laptop with 8G memory.

A birth cohort study of mother-newborn pairs
Each year in the USA, maternal smoking affects over half a million pregnancies, resulting in fetal growth limitations.This is evident from the reduced birthweight and its subsequent long-term implications as noted by Wang et al. (2002).In this section, we apply our proposed methodology to a birth cohort study comprising mother-newborn pairs from a predominantly urban, low-income multi-ethnic birth cohort in the USA.
The study comprised a total of 954 mother-newborn pairs from the racially diverse Boston Birth Cohort (Pearson et al. 2022).Among the mothers, 165 (17.3%) had a history of smoking either before or during pregnancy.Newborns exposed to smoking exhibited an average birthweight reduction of 258 g compared to those without exposure (Xu et al. 2021).The DNA methylation (DNAm) markers of each newborn were measured using the Illumina Infinium MethylationEPIC BeadChip, which successfully produced DNAm profiles for 865 859 CpG sites (P ¼ 865 859).For each CpG site examined, a b-value ranging from 0 to 1 was reported.We aim to investigate the mediating role of DNA methylation (DNAm) markers in the association between smoking (binary exposure: 1 ¼ smoker and 0 ¼ non-smoker) and birthweight (outcome).We also adjust for confounding variables, including maternal age at delivery; parity (not including the index pregnancy): 0 versus 1 or more; maternal education: high school or less versus some college or more; maternal self-reported race: Black versus non-Black, where Black included self-reported Black (African American and Haitian) and non-Black included white and Hispanic; maternal alcohol consumption during pregnancy: never versus ever; maternal pre-pregnancy BMI; child's sex: female versus male; and gestational age at birth.We also adjust for estimated cord blood cell composition, which was calculated using the estimateCellCounts() function in the "minf" package, with seven cell types: CD4þ, T cells, B cells, monocytes, granulocytes, natural killer cells, and nucleated red blood cells.
We illustrate the distribution of infant birth weight in Fig. 2. We are interested in examining the mediation effects across the entire spectrum of the outcome distribution.In other words, we are looking beyond just the average outcome distribution, as is typically done using mean-regression methods.Specifically, we are keen on the effects on both the low birth weight (LBW) and high birth weight (HBW) samples.Based on the distribution, we can roughly categorize s 2 f0:2; 0:3g as the LBW group, s 2 f0:4; 0:5; 0:6g as the normal weight group, and s 2 f0:7; 0:8g as the HBW group.The results from our method for the chosen mediators are summarized in Table 5.In contrast, with the same dataset, the methods by Shen et al. (2014) and Bind et al. (2017) cannot select any significant mediators after adjusting for multiple comparisons.Under s 2 f0:2; 0:3g (i.e.LBW group), we identified three CpG sites (cg25325512, cg07814318, and cg14541773) as significant mediators, while under s 2  High-dimensional quantile mediation analysis f0:4; 0:5; 0:6g (i.e.regular weight group), one CpG site (cg07814318) is consistently identified to be significant mediators.Studies have shown that the effect of maternal smoking on birth weight is partly mediated by the methylation of cg25325512 (PIM1), which contributes to a decrease in birth weight.Several studies have also identified that maternal smoking may influence DNA methylation at the KLF13 gene region in offsprings.This association may be dose dependent, with stronger effects observed with higher smoking intensity and duration.Additionally, the studies indicate that the association between KLF13 methylation and smoking may have implications for child health and the developmental origins of the metabolic syndrome.Under s 2 f0:7; 0:8g (i.e.HBW group), no significant mediator is identified under s ¼ 0:7, but under s ¼ 0:8, our method identifies CpG sites cg04411342 and cg05575921 as significant mediators.cg05575921 (within gene AHRR) exhibits a significant mediating effect in the higher quantiles (HBW), while no mediating effect is observed in the lower quantiles (LBW) and median.Our previous study has identified CpG site cg05575921 as a significant mediator of maternal smoking and birthweight (Xu et al. 2021).A low level of methylation at the cg05575921 locus in the AHRR gene has been robustly associated with smoking (Grieshober et al. 2020).This hypomethylation is thought to mediate the association between maternal smoking and metabolic profiles in children.Functionally, cigarette smoke contains toxic components, such as polycyclic aromatic hydrocarbons, which can induce aryl hydrocarbon receptor (AHR)-mediated AHRR expression and methylation (Tantoh et al. 2020).These results demonstrate the capability of our method to thoroughly capture the complex mediating effects across different sections of the outcome distribution.
Per the suggestion of a reviewer, in Supplementary Table SA, we present a summary of the top five CpGs identified by Shen et al. and Bind et al. in their application to the real data.When we compare these results with those obtained through our method in Table 5, numerous discrepancies are evident.Bind et al. (2017); "Proposed" denotes our method.
For instance, many of the CpGs identified by our method, such as cg25325512, cg07814318, cg04411342, do not appear in the top five lists of Shen et al. and Bind et al.However, there are some points of agreement.At s ¼ 0:3, cg14541773 is selected both by our method and by Shen et al.; cg05575921 is included at s 2 f0:4; 0:5g by Shen et al. and Bind et al., while our method identifies cg05575921 at s ¼ 0:8.

Concluding remarks
This study introduced a novel high-dimensional quantile mediation model.A three-step procedure was employed to identify active mediators.Simulations and real data application were conducted to demonstrate the practical utility of our approach.The R function qHIMA() was developed to facilitate the implementation of our method in practice, providing a user-friendly interface.Assumption (C.4) highlights that while mediators may be correlated, they do not have causal relationships with each other.This is a crucial assumption for the validity of the causal interpretations of our Conditional Direct Effect (CDE) and Conditional Indirect Effect (CIE), even though exploring the causal links between the mediators themselves is not our focus.
There are several topics that will be studied in the future.First, the present study sheds light on high-dimensional quantile mediation analysis, which has the potential for further expansion to include survival outcomes (Zhang et al. 2021c) and longitudinal data (Bind et al. 2017).Second, the proposed method incorporates linear models for the mediators and a quantile model for the outcome.It is desirable to consider using quantile models for both mediators and outcomes under our framework.Third, the development of a robust high-dimensional mediation analysis procedure becomes intriguing when potential outliers are present in the mediators and outcomes.Fourth, the investigation of interactions between exposure and high-dimensional mediators poses significant challenges, necessitating further endeavors to address this issue.Fifth, we have imposed a sparsity condition for active mediators when applying our method.However, the  a s 2 f0:2; 0:3g is referred to as the low birth weight group; s 2 f0:8g is the high birth weight group; Q s denotes the sth quantile of birth weight data (in grams).
High-dimensional quantile mediation analysis approach to handling quantile mediation effects remains uncertain in situations where there are multiple true mediators that exceed the sample size.

Figure 2 .
Figure 2. The histogram of infant birth weight (in grams).
bÞ denotes the maximum of a and b; P a k ¼ 2f1−U Nð0;1Þ ðjâ k j=r a k Þg; âk is the ordinary least squares estimate with its estimated standard error ra k , and P b k;s ¼ 2f1−U Nð0;1Þ ðj bk;s j=r b k;s Þg.
minðd s P JS k;s ; 1Þ; k 2 S s ; (7) where d s is the cardinality of set S s ; P JS k;s ¼ maxðP a k ; P b k ;s Þ; maxða;

Table 1 .
The (Bias, SSE) of estimation for mediation effects in Case I.

;s Shen et al. a Bind et al. b Proposed c Shen et al. Bind et al. Proposed
a "Shen et al" denotes the method of Shen et al. (2014); b "Bind et al." denotes the method of Bind et al. (2017); c "Proposed" denotes our method.

Table 2 .
The (Bias, SSE) of estimation for mediation effects in Case II.

;s Shen et al. a Bind et al. b Proposed c Shen et al. Bind et al. Proposed
a "Shen et al." denotes the method of Shen et al. (2014); b "Bind et al." denotes the method of Bind et al. (2017); c "Proposed" denotes our method.

Table 4 .
The performance of mediator selection in Case II.

Table 5 .
Summary results of selected CpGs in the birth cohort study.